Code
import pandas as pd
kakamana
March 19, 2023
This course will teach us the basics of feature engineering and how to use it. We’ll load, explore, and visualize a survey response dataset, and we’ll see what types of data it contains, and why those types influence how you build your feature set. We’ll make new features from both categorical and continuous columns with the pandas package
This Creating features is part of Datacamp course: Feature engineering for machine learning in Python
This is my learning experience of data science through DataCamp. These repository contributions are part of my learning journey through my graduate program masters of applied data sciences (MADS) at University Of Michigan, DeepLearning.AI, Coursera & DataCamp. You can find my similar articles & more stories at my medium & LinkedIn profile. I am available at kaggle & github blogs & github repos. Thank you for your motivation, support & valuable feedback.
These include projects, coursework & notebook which I learned through my data science journey. They are created for reproducible & future reference purpose only. All source code, slides or screenshot are intellactual property of respective content authors. If you find these contents beneficial, kindly consider learning subscription from DeepLearning.AI Subscription, Coursera, DataCamp
* Different types of data:
* Continuous: either integers (or whole numbers) or floats (decimals)
* Categorical: one of a limited set of values, e.g., gender, country of birth
* Ordinal: ranked values often with no details of distance between them
* Boolean: True/False values
* Datetime: dates and times
SurveyDate | FormalEducation | ConvertedSalary | Hobby | Country | StackOverflowJobsRecommend | VersionControl | Age | Years Experience | Gender | RawSalary | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2/28/18 20:20 | Bachelor's degree (BA. BS. B.Eng.. etc.) | NaN | Yes | South Africa | NaN | Git | 21 | 13 | Male | NaN |
1 | 6/28/18 13:26 | Bachelor's degree (BA. BS. B.Eng.. etc.) | 70841.0 | Yes | Sweeden | 7.0 | Git;Subversion | 38 | 9 | Male | 70,841.00 |
2 | 6/6/18 3:37 | Bachelor's degree (BA. BS. B.Eng.. etc.) | NaN | No | Sweeden | 8.0 | Git | 45 | 11 | NaN | NaN |
3 | 5/9/18 1:06 | Some college/university study without earning ... | 21426.0 | Yes | Sweeden | NaN | Zip file back-ups | 46 | 12 | Male | 21,426.00 |
4 | 4/12/18 22:41 | Bachelor's degree (BA. BS. B.Eng.. etc.) | 41671.0 | Yes | UK | 8.0 | Git | 39 | 7 | Male | £41,671.00 |
Datasets often have columns with multiple data types (like the one you’re working with). Most machine learning models require a consistent data type across features. Most feature engineering techniques only work with one type of data at a time. When working with DataFrames, you’ll often want to access just certain types of columns.
Index(['ConvertedSalary', 'StackOverflowJobsRecommend', 'Age',
'Years Experience'],
dtype='object')
Encoding categorical features
One-hot encoding
Dummy encoding
One-hot vs. dummies
One-hot encoding: Explainable features
Dummy encoding: Necessary information without duplication
One-hot encoding and dummy variables
To use categorical variables in a machine learning model, you first need to represent them in a quantitative way. The two most common approaches are to one-hot encode the variables using or to use dummy variables. In this exercise, you will create both types of encoding, and compare the created column sets.
Index(['SurveyDate', 'FormalEducation', 'ConvertedSalary', 'Hobby',
'StackOverflowJobsRecommend', 'VersionControl', 'Age',
'Years Experience', 'Gender', 'RawSalary', 'OH_France', 'OH_India',
'OH_Ireland', 'OH_Russia', 'OH_South Africa', 'OH_Spain', 'OH_Sweeden',
'OH_UK', 'OH_USA', 'OH_Ukraine'],
dtype='object')
Index(['SurveyDate', 'FormalEducation', 'ConvertedSalary', 'Hobby',
'StackOverflowJobsRecommend', 'VersionControl', 'Age',
'Years Experience', 'Gender', 'RawSalary', 'DM_India', 'DM_Ireland',
'DM_Russia', 'DM_South Africa', 'DM_Spain', 'DM_Sweeden', 'DM_UK',
'DM_USA', 'DM_Ukraine'],
dtype='object')
There can be a lot of different categories for some features, but they’re not evenly distributed. For instance, Data Science’s favorite languages include Python, R, and Julia. But some people have their own bespoke choices, like FORTRAN, C, and so on. You might not want to create a feature for every value, but just for the ones that show up most often.
South Africa 166
USA 164
Spain 134
Sweeden 119
France 115
Russia 97
UK 95
India 95
Ukraine 9
Ireland 5
Name: Country, dtype: int64
0 False
1 False
2 False
3 False
4 False
Name: Country, dtype: bool
South Africa 166
USA 164
Spain 134
Sweeden 119
France 115
Russia 97
UK 95
India 95
Other 14
Name: Country, dtype: int64
C:\Users\dghr201\AppData\Local\Temp\ipykernel_37200\753486482.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
countries[mask] = 'Other'
Binarizing columns
Even though numeric values can often be used without feature engineering, there will be times when manipulation can be useful. For example, sometimes you don’t care about the magnitude of a value, just its direction, or even if it exists. You’ll want to binarize a column in these cases. The so_survey_df data has a lot of survey respondents who are working for free (without pay). Adding a new column titled Paid_Job will let you know whether each person is paid (their salary is greater than zero).
# Create the Paid_Job column filled with zeros
so_survey_df['Paid_Job'] = 0
# Replace all the Paid_Job values where ConvertedSalary is > 0
so_survey_df.loc[so_survey_df['ConvertedSalary'] > 0, 'Paid_Job'] = 1
# Print the first five rows of the columns
print(so_survey_df[['Paid_Job', 'ConvertedSalary']].head())
Paid_Job ConvertedSalary
0 0 NaN
1 1 70841.0
2 0 NaN
3 1 21426.0
4 1 41671.0
You don’t really care about the exact value of a numeric column, but rather the bucket it falls into. You can use this when plotting values or simplifying machine learning models. Most of the time, it’s used on continuous variables where accuracy isn’t as important e.g. age, height, wages.
equal_binned ConvertedSalary
0 NaN NaN
1 (-2000.0, 400000.0] 70841.0
2 NaN NaN
3 (-2000.0, 400000.0] 21426.0
4 (-2000.0, 400000.0] 41671.0
# Import numpy
import numpy as np
# Specify the boundaries of the bins
bins = [-np.inf, 10000, 50000, 100000, 150000, np.inf]
# Bin labels
labels = ['Very low', 'Low', 'Medium', 'High', 'Very high']
# Bin the continuous variable ConvertedSalary using these boundaries
so_survey_df['boundary_binned'] = pd.cut(so_survey_df['ConvertedSalary'],
bins, bins,labels=labels)
# Print the first 5 rows of the boundary_binned column
print(so_survey_df[['boundary_binned', 'ConvertedSalary']].head())
boundary_binned ConvertedSalary
0 NaN NaN
1 Medium 70841.0
2 NaN NaN
3 Low 21426.0
4 Low 41671.0